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Abstract 

This paper proposes a Japanese/Enghsh cross- 
language information retrieval (CLIR) system 
targeting technical documents. Our system 
first translates a given query containing tech- 
nical terms into the target language, and then 
retrieves documents relevant to the translated 
query. The translation of technical terms is still 
problematic in that technical terms are often 
compound words, and thus new terms can be 
progressively created simply by combining ex- 
isting base words. In addition. Japanese of- 



t-pn rpprpgpntg loan-i^rnrrlg hagpH nn itg plinnn- 

gram. Consequently, existing dictionaries find 
it difficult to achieve sufficient coverage. To 
counter the first problem, we use a compound 
word translation method, which uses a bilin- 
gual dictionary for base words and collocational 
statistics to resolve translation ambiguity. For 
the second problem, we propose a translitera- 
tion method, which identifies phonetic equiva- 
lents in the target language. We also show the 
effectiveness of our system using a test collec- 
tion for CLIR. 

1 Introduction 

Cross-language information retrieval (CLIR), 
where the user presents queries in one language 
to retrieve documents in another language, has 
recently been one of the major topics within the 
information retrieval community. One strong 
motivation for CLIR is the growing number of 
documents in various languages accessible via 
the Internet. Since queries and documents are 
in different languages, CLIR requires a trans- 
lation phase along with the usual monolingual 
retrieval phase. For this purpose, existing CLIR 



systems adopt various techniques explored in 
natural language processing (NLP) research. In 
brief, bilingual dictionaries, corpora, thesauri 
and machine translation (MT) systems are used 
to translate queries or/and documents. 

In this paper, we propose a Japanese/English 
CLIR system for technical documents, focus- 
ing on translation of technical terms. Our 
purpose also includes integration of different 
components within one framework. Our re- 
search is partly motivated by the "NACSIS" 



test collection for IR systems ( Kando et al. 
1998DP], which consists of Japanese queries 



and Japanese/English abstracts extracted from 
technical papers (we will elaborate on the NAC- 
SIS collection in Section §). Using this col- 
lection, we investigate the effectiveness of each 
component as well as the overall performance of 
the system. 

As with MT systems, existing CLIR systems 
still find it difficult to translate technical terms 
and proper nouns, which are often unlisted in 
general dictionaries. Since most CLIR systems 
target newspaper articles, which are comprised 
mainly of general words, the problem related to 
unlisted words has been less explored than other 
CLIR subtopics (such as resolution of transla- 
tion ambiguity). However, Pirkola ( |1998| ), for 
example, used a subset of the TREC collection 
related to health topics, and showed that com- 
bination of general and domain specific (i.e., 
medical) dictionaries improves the CLIR perfor- 
mance obtained with only a general dictionary. 
This result shows the potential contribution of 
technical term translation to CLIR. At the same 
time, note that even domain specific dictionaries 



^http : //www. rd.nacsis . ac . jp/~ntcadni/ index-en.html 



do not exhaustively list possible technical terms. 
We classify problems associated with technical 
term translation as given below: 



terms of the implementation of the translation 
phase. The first approach translates queries 



into the document language (Ballesteros and 



^ ,,rnrrl Croft, 1998 ; Parbonell et al., 1997 ; Davis and 



which can be progrossivoly created simply 



Ogdcn, 1997; Fujii and Ishikawa, 1999; Hull and 



by combining multiple existing morphemes 



Grefenstette, 1996; Kando and Aizawa, 1998 



pkumura et al., 1998 ), while the second ap- 
proach translates documents into the query lan- 
guage ( Gachot et al., 1996| ; |Oard and Hack- 



("base words"), and therefore it is not en- 
tirely satisfactory to exhaustively enumer- 
ate newly emerging terms in dictionaries, 



(2) Asian languages often represent loanwords 
based on their special phonograms (primar- 



ett, 1997). The third approach transfers both 



queries and documents into an interlingual rep- 
resentation: bilingual thesaurus classes ( |Mon- 



liy lor technical terms and proper nounsj, gar, 1969|; [Salton, 197^ ; [Sheridan and Ballerini 



winch creates new base words progressively 1996D and language-independent vector space 



(in the case of Japanese, the phonogram is models ( [Carbonen et al., 19971 Pumais et al 



called katakana). 



To counter problem (|T|), we use the compound 
word translation method we proposed ( Fujii] 
and Ishikawa, 1999 ), which selects appropri- 
ate translations based on the probability of oc- 
currence of each combination of base words in 
the target langua ge. For problem (^) , we use 
"transliteration" ( phen et al., 1998| [Knight 



and Graehl, 199^ ; [Wan and Verspoor, 1998[ ). 
Chen et al. (|199|) and Wan and Verspoor (|199|) 
proposed English-Chinese transliteration meth- 
ods relying on the property of the Chinese 
phonetic system, which cannot be directly ap- 
plied to transliteration between English and 



Japanese. Knight and Graehl ([1998[) proposed a 
Japanese-English transliteration method based 
on the mapping probability between English 
and Japanese katakana sounds. However, since 
their method needs large-scale phoneme inven- 
tories, we propose a simpler approach using 
surface mapping between English and katakana 
characters, rather than sounds. 

Section ^ overviews our CLIR system, and 
Section ^ elaborates on the translation mod- 
ule focusing on compound word translation and 
transliteration. Section ^ then evaluates the 
effectiveness of our CLIR system by way of 
the standardized IR evaluation method used in 
TREC programs. 

2 System Overview 

Before explaining our CLIR system, we clas- 
sify existing CLIR into three approaches in 



1996 ). We prefer the first approach, the "query 



translation", to other approaches because (a) 
translating all the documents in a given col- 
lection is expensive, (b) the use of thesauri re- 
quires manual construction or bilingual compa- 
rable corpora, (c) interlingual vector space mod- 
els also need comparable corpora, and (d) query 
translation can easily be combined with existing 
IR engines and thus the implementation cost is 
low. At the same time, we concede that other 
CLIR approaches are worth further exploration. 

Figure || depicts the overall design of our 
CLIR system, where most components are the 
same as those for monolingual IR, excluding 
"translator" . 

First, "tokenizer" processes "documents" in 
a given collection to produce an inverted file 
("surrogates"). Since our system is bidirec- 
tional, tokenization differs depending on the 
target language. In the case where documents 
are in English, tokenization involves eliminat- 
ing stopwords and identifying root forms for 
inflected words, for which we used "Word- 
Net" ([Miller et al., 1993| ). On the other hand, 
we segment Japanese documents into lexical 
units using the "ChaSen" morphological ana- 
lyzer ( [Matsumoto et al., 1997 ) and discard stop- 
words. In the current implementation, we use 
word-based uni-gram indexing for both English 
and Japanese documents. In other words, com- 
pound words are decomposed into base words 
in the surrogates. Note that indexing and re- 
trieval methods are theoretically independent of 



the translation method. 

Thereafter, the "translator" processes a query 
in the source language ("S-query") to output 
the translation ("T-query"). T-query can con- 
sist of more than one translation, because mul- 
tiple translations are often appropriate for a sin- 
gle technical term. 

Finally, the "IR engine" computes the sim- 
ilarity between T-query and each document 
in the surrogates based on the vector space 
model ( ^alton and McGill, 1983 ), and sorts doc- 
ument according to the similarity, in descending 
order. We compute term weight based on the 
notion of TF-IDF. Note that T-query is decom- 
posed into base words, as performed in the doc- 
ument preprocessing. 

In Section ^, we will explain the "translator" 
in Figure ||, which involves compound word 
translation and transliteration modules. 




( result ) 



Figure 1: The overall design of our CLIR system 



3 Translation Module 
3.1 Overview 

Given a query in the source language, tokeniza- 
tion is first performed as for target documents 
(see Figure ||) . To put it more precisely, we use 
WordNet and ChaSen for English and Japanese 
queries, respectively. We then discard stop- 
words and extract only content words. Here, 
"content words" refer to both single and com- 
pound words. Let us take the following query 
as an example: 

improvement of data mining methods. 

For this query, we discard "of", to extract "im- 
provement" and "data mining methods" . 



Thereafter, we translate each extracted con- 
tent word individually. Note that we currently 
do not consider relation (e.g. syntactic relation 
and collocational information) between content 
words. If a single word, such as "improvement" 
in the example above, is listed in our bilingual 
dictionary (we will explain the way to produce 
the dictionary in Section p. 2D , we use all pos- 
sible translation candidates as query terms for 
the subsequent retrieval phase. 

Otherwise, compound word translation is 
performed. In the case of Japanese-English 
translation, we consider all possible segmenta- 
tions of the input word, by consulting the dic- 
tionary. Then, we select such segmentations 
that consist of the minimal number of base 
words. During the segmentation process, the 
dictionary derives all possible translations for 
base words. At the same time, transliteration 
is performed whenever katakana sequences un- 
listed in the dictionary are found. On the other 
hand, in the case of English-Japanese transla- 
tion, transliteration is applied to any unlisted 
base word (including the case where the input 
English word consists of a single base word) . Fi- 
nally, we compute the probability of occurrence 
of each combination of base words in the target 
language, and select those with greater proba- 
bilities, for both Japanese-English and English- 
Japanese translations. 

3.2 Compound Word Translation 

This section briefly explains the compound 
word translation method we previously pro- 
posed ( Fujii and Ishikawa, 1999| ). This method 
translates input compound words on a word-by- 
word basis, maintaining the word order in the 
source language]^. The formula for the source 
compound word and one translation candidate 
are represented as below. 

S = Si,S2,...,Sn 
T = ti,t2, ■ ■ ■ ,tn 



preliminary study showed that approximately 95% 
of compound technical terms defined in a bilingual dic- 
tionary maintain the same word order in both source and 
target languages. 



Here, Si and ti denote i-th base words in source 
and target languages, respectively. Our task, 
i.e., to select T which maximizes P{T\S), is 
transformed into Equation (|l|) through use of 
the Bayesian theorem. 

argmaxP(r|S) = argmaxP(5|r) • P{T) (1) 

P{S\T) and P{T) are approximated as in Equa- 
tion (Q), which has commonly been used in 
the recent statistical NLP research ( [Church and 
Mercer, 199^ ). 

n 

Pis\T) « n^(^ii*o 

i=l , . 

(2) 

p(r) « l[Piu+,\u) 

1=1 

We produced our own dictionary, because 
conventional dictionaries are comprised primar- 
ily of general words and verbose definitions 
aimed at human readers. We extracted 59,533 
English/ Japanese translations consisting of two 
base words from the EDR technical terminol- 
ogy dictionary, which contains about 120,000 
translations related to the information process- 



ing field (Japan Electronic Dictionary Research 


Institute, 1995 


), and segment Japanese entries 
. For this purpose, simple heuris- 
. mainly on Japanese character 


into two parts^ 
tic rules basec 



types (i.e., kanji, katakana, hiragana, alpha- 
bets and other characters like numerals) were 
used. Given the set of compound words where 
Japanese entries are segmented, we correspond 
English- Japanese base words on a word-by- word 
basis, maintaining the word order between En- 
glish and Japanese, to produce a Japanese- 
English/English- Japanese base word dictionary. 
As a result, we extracted 24,439 Japanese base 
words and 7,910 English base words from the 
EDR dictionary. During the dictionary produc- 
tion, we also count the collocational frequency 
for each combination of Si and ti, in order to 
estimate P{si\ti). Note that in the case where 

■^The number of base words can easily be identified 
based on English words, while Japanese compound words 
lack lexical segmentation. 



Si is transliterated into ti, we use an arbitrar- 
ily predefined value for P{si\ti). For the esti- 
mation of P{ti+i\ti), we use the word-based bi- 
gram statistics obtained from target language 
corpora, i.e., "documents" in the collection (see 
Figure [l|) . 

3.3 Transliteration 

Figure ^ shows example correspondences be- 
tween English and (romanized) katakana words, 
where we insert hyphens between each katakana 
character for enhanced readability. The basis 
of our transliteration method is analogous to 
that for compound word translation described 
in Section |3.2| . The formula for the source 
word and one transliteration candidate are rep- 
resented as below. 

S = Si,S2,...,Sn 
T = ti,t2, ■ ■ ■ ,tn 

However, unlike the case of compound word 
translation, Sj and ti denote i-th "symbols" 
(which consist of one or more letters), respec- 
tively. Note that we consider only such T's 
that are indexed in the inverted file, because 
our transliteration method often outputs a num- 
ber of incorrect words with great probabilities. 
Then, we compute P{T\S) for each T using 
Equations (||) and (§) (see Section |3.2D , and 
select fc-best candidates with greater probabili- 
ties. The crucial content here is the way to pro- 
duce a bilingual dictionary for symbols. For this 
purpose, we used approximately 3,000 katakana 
entries and their English translations listed in 
our base word dictionary. To illustrate our dic- 
tionary production method, we consider Fig- 
ure |2| again. Looking at this figure, one may 
notice that the first letter in each katakana 
character tends to be contained in its corre- 
sponding English word. However, there are a 
few exceptions. A typical case is that since 
Japanese has no distinction between "L" and 
"R" sounds, the two English sounds collapse 
into the same Japanese sound. In addition, 
a single English letter corresponds to multiple 
katakana characters, such as "x" to ^^ki-su" in 
"<text, te-ki-su-to>^\ To sum up, English and 



romanized katakana words are not exactly iden- 
tical, but similar to each other. 



English 


katakana 


system 


shi-su-te-mu 


mining 


ma-i-ni-n-gu 


data 


dee-ta 


network 


ne-tto-waa-ku 


text 


te-ki-su-to 


collocation 


ko-ro-ke-i-sho-n 



Figure 2: Examples of English-A;atoA;ana corre- 
spondence 



We first manually define the similarity be- 
tween the English letter e and the first roman- 
ized letter for each katakana character j, as 
shown in Table |l[ In this table, "phonetically 
similar" letters refer to a certain pair of letters, 
such as "L" and "R"0. We then consider the 
similarity for any possible combination of let- 
ters in English and romanized katakana words, 
which can be represented as a matrix, as shown 
in Figure |. This fi gure shows the similarity 
between letters in "<text, te-ki-su-to>" . We 
put a dummy letter "$", which has a positive 
similarity only to itself, at the end of both En- 
glish and katakana words. One may notice that 
matching plausible symbols can be seen as find- 
ing the path which maximizes the total similar- 
ity from the first to last letters. The best path 
can easily be found by, for example, Dijkstra's 
algorithm ( |Dijkstra, 1959| ). From Figure 
we can derive the following correspondences: 
"<te, te>", "<x, ki-su>" and "<t, to>". The 
resultant correspondences contain 944 Japanese 
and 790 English symbol types, from which we 
also estimated P{si\ti) and 

As can be predicted, a preliminary experi- 
ment showed that our transliteration method 
is not accurate when compared with a word- 
based translation. For example, the Japanese 
word "re-j'i-SM-ta (register)" is transliterated to 
"resister", "resistor" and "register", with the 
probability score in descending order. How- 



ever, combined with the compound word trans- 
lation, irrelevant transliteration outputs are ex- 
pected to be discarded. For example, a com- 
pound word like "re-ji-su-ta tensou gengo (reg- 
ister transfer language)" is successfully trans- 
lated, given a set of base words ^^tensou (trans- 
fer)" and ^^gengo (language)" as a context. 



Table 1: The similarity between English and 
Japanese letters 



condition 


similarity 


e and j are identical 


3 


e and j are phonetically similar 


2 


both e and j are vowels or consonants 


1 


otherwise 








te 


ki 


su 


to 


$ 


t 






2 


3 





e 















X 


1 




1 


1 





t 


3 


1 


2 






$ 
















Figure 3: An example matrix for English- 
Japanese symbol matching (arrows denote the 
best path) 

4 Evaluation 

This section investigates the performance of our 
CLIR system based on the TREC-type evalu- 
ation methodology: the system outputs 1,000 
top documents, and TREC evaluation software 
is used to calculate the recall-precision trade-ofi^ 
and 11-point average precision. 

For the purpose of our evaluation, we used 



*We identified approximately twenty pairs of phonet- 
ically similar letters. 



the NACSIS test cohection ( |Kando etal., 1998D . 
This collection consists of 21 Japanese queries 
and approximately 330,000 documents (in ei- 



ther a combination of English and Japanese or 
either of the languages individually), collected 
from technical papers published by 65 Japanese 
associations for various fields. Each document 
consists of the document ID, title, name(s) of 
author (s), name/date of conference, hosting or- 
ganization, abstract and keywords, from which 
titles, abstracts and keywords were used for our 
evaluation. We used as target documents ap- 
proximately 187,000 entries where abstracts are 
in both English and Japanese. Each query con- 
sists of the title of the topic, description, narra- 
tive and list of synonyms, from which we used 
only the description. Roughly speaking, most 
topics are related to electronic, information and 
control engineering. Figure |^ shows example de- 
scriptions (translated into English by one of the 
authors). Relevance assessment was performed 
based on one of the three ranks of relevance, 
i.e., "relevant", "partially relevant" and "irrel- 
evant". In our evaluation, relevant documents 
refer to both "relevant" and "partially relevant" 
document^. 



ID 


description 


0005 
0006 
0019 
0024 


dimension reduction for clustering 
intelligent information retrieval 
syntactic analysis methods for Japanese 
machine translation systems 



Figure 4: Example descriptions in the NACSIS 
query 



4.1 Evaluation of compound word 
translation 

We compared the following query translation 
methods: 

(1) a control, in which all possible translations 
derived from the (original) EDR technical 
terminology dictionary are used as query 
terms ("EDR"), 

(2) all possible base word translations derived 
from our dictionary are used ("all"), 

^The result did not significantly change depending on 
whether we regarded "partially relevant" as relevant or 
not. 



(3) randomly selected k translations derived 
from our bilingual dictionary are used 
( "random" ) , 

(4) A;-best translations through compound 
word translation are used ("CWT"). 

For system "EDR" , compound words unlisted in 
the EDR dictionary were manually segmented 
so that substrings (shorter compound words or 
base words) can be translated. For both sys- 
tems "random" and "CWT" , we arbitrarily set 
k = 3. Figure |5| and Table |2| show the recall- 
precision curve and 11-point average precision 
for each method, respectively. In these, "J-J" 
refers to the result obtained by the Japanese- 
Japanese IR system, which uses as documents 
Japanese titles/abstracts/keywords comparable 
to English fields in the NACSIS collection. This 
can be seen as the upper bound for CLIR perfor- 
mance^. Looking at these results, we can con- 
clude that the dictionary production and prob- 
abilistic translation methods we proposed are 
effective for CLIR. 




recall 

Figure 5: Recall-Precision curves for evaluation 
of compound word translation 



Regrettably, since the NACSIS collection does not 
contain English queries, we cannot estimate the upper 
bound performance by English- English IR. 



Table 2: Comparison of average precision for 
evaluation of compound word translation 





avg. precision 


ratio to J-J 


J-J 


0.204 




CWT 


0.193 


0.946 


all 


0.171 


0.838 


EDR 


0.130 


0.637 


random 


0.116 


0.569 



4.2 Evaluation of transliteration 

In the NACSIS collection, three queries con- 
tain katakana (base) words unlisted in our bilin- 
gual dictionary. Those words are "ma-i-ni- 
n-gu (mining)" and ^^ko-ro-ke-i-sho-n (colloca- 
tion)". However, to emphasize the effectiveness 
of transliteration, we compared the following ex- 
treme cases: 

(1) a control, in which every katakana word is 
discarded from queries ("control"), 

(2) a case where transliteration is applied to 
every katakana word and top 10 candidates 
are used ( "translit" ) . 



Both cases use system "CWT" in Section 4.1 
In the case of "translit" , we do not use katakana 
entries listed in the base word dictionary. Fig- 
ure P and Table ^ show the recall-precision curve 
and 11-point average precision for each case, re- 
spectively. In these, results for "CWT" corre- 
spond to those in Figure |5| and Table ^, respec- 
tively. We can conclude that our transliteration 
method significantly improves the baseline per- 
formance (i.e., "control"), and comparable to 
word-based translation in terms of CLIR per- 
formance. 

An interesting observation is that the use of 
transliteration is robust against typos in docu- 
ments, because a number of similar strings are 
used as query terms. For example, our translit- 
eration method produced the following strings 
for ^^ri-da-ku-sho-n (reduction)": 

riduction, redction, redaction, reduc- 
tion. 



All of these words are effective for retrieval, be- 
cause they are contained in the target docu- 
ments. 



J-J 
CWT 
translit 
control 




0.2 0.4 0.6 0.8 
recall 



Figure 6: Recall-Precision curves for evaluation 
of transliteration 



Table 3: Comparison of average precision for 
evaluation of transliteration 





avg. precision 


ratio to J-J 


J-J 


0.204 




CWT 


0.193 


0.946 


translit 


0.193 


0.946 


control 


0.115 


0.564 



4.3 Evaluation of the overall 
performance 

We compared our system ("CWT-l-translit") 
with the Japanese-Japanese IR system, where 
(unlike the evaluation in Section |4.2| ) transliter- 
ation was applied only to ^^ma-i-ni-n-gu (min- 
ing)" and "ko-ro-ke-i-sho-n (collocation)". Fig- 
ure and Table ^ show the recall-precision curve 
and 11-point average precision for each sys- 
tem, respectively, from which one can see that 
our CLIR system is quite comparable with the 
monolingual IR system in performance. In ad- 
dition, from Figure ^ to 0, one can see that the 
monolingual system generally performs better 



at lower recall while the CLIR system performs 
better at higher recall. 

For further investigation, let us discuss sim- 
ilar experimental results reported by Kando 
and Aizawa ( |1998| ) , where a bilingual dictionary 
produced from Japanese/English keyword pairs 
in the NACSIS documents is used for query 
translation. Their evaluation method is al- 
most the same as performed in our experiments. 
One difference is that they use the "OpenText" 
search engine^, and thus the performance for 
Japanese-Japanese IR is higher than obtained 
in our evaluation. However, the performance 
of their Japanese-English CLIR systems, which 
is roughly 50-60% of that for their Japanese- 
Japanese IR system, is comparable with our 
CLIR system performance. It is expected that 
using a more sophisticated search engine, our 
CLIR system will achieve a higher performance 
than that obtained by Kando and Aizawa. 




0.2 0.4 0.6 0.8 1 

recall 

Figure 7: Recall-Precision curves for evaluation 
of overall performance 

5 Conclusion 

In this paper, we proposed a Japanese/English 
cross-language information retrieval system, 
targeting technical documents. We combined 
a query translation module, which performs 

^Developed by OpenText Corp. 



Table 4: Comparison of average precision for 
evaluation of overall performance 





avg. precision 


ratio to J-J 


J-J 


0.204 




CWT + translit 


0.212 


1.04 



compound word translation and translitera- 
tion, with an existing monolingual retrieval 
method. Our experimental results showed that 
compound word translation and transliteration 
methods individually improve on the baseline 
performance, and when used together the im- 
provement is even greater. Future work will in- 
clude the application of automatic word align- 
ment methods ( [Fung, 1995| ; pmadja et al., 19"96 ) 
to enhance the dictionary. 
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